Doing some real work this time:
First setting up the plotting defaults and numpy.
In [1]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
In [2]:
%matplotlib inline
plt.rcParams['figure.figsize'] = 6, 4.5
plt.rcParams['axes.grid'] = True
plt.gray()
In [3]:
cd ..
In [4]:
import train
import json
import imp
In [5]:
settings = json.load(open('SETTINGS.json', 'r'))
In [6]:
settings['FEATURES']
Out[6]:
In [8]:
# reducing no. of features for faster prototyping
settings['FEATURES'] = settings['FEATURES'][1:2]
In [9]:
data = train.get_data(settings['FEATURES'])
In [10]:
!free -m
The leaderboard test is apparently a combined ROC AUC over all of the test sets, although I can't find the Forum post about this. This shouldn't be too hard to mimic, and what we want is a function that will run a single batch of cross-validation given a model and return results for the various folds.
In [18]:
# getting a set of the subjects involved
subjects = set(list(data.values())[0].keys())
print(subjects)
So the easiest way to do this will be to use the cross_val_score method of scikit-learn on a training set for each subject.
It would be best to arrange this as a function so that it can be called with a map for parallelism later, if required.
The first part is a function that will take a model and a subject and return a set of probabilities along with the true labels.
In [28]:
import sklearn.preprocessing
import sklearn.pipeline
import sklearn.ensemble
import sklearn.cross_validation
from train import utils
In [20]:
scaler = sklearn.preprocessing.StandardScaler()
forest = sklearn.ensemble.RandomForestClassifier()
model = sklearn.pipeline.Pipeline([('scl',scaler),('clf',forest)])
In [62]:
def subjpredictions(subject,model,data):
X,y = utils.build_training(subject,list(data.keys()),data)
cv = sklearn.cross_validation.StratifiedShuffleSplit(y)
predictions = []
labels = []
for train,test in cv:
model.fit(X[train],y[train])
predictions.append(model.predict_proba(X[test]))
labels.append(y[test])
predictions = np.vstack(predictions)[:,1]
labels = np.hstack(labels)
return predictions,labels
In [68]:
p,l = subjpredictions(list(subjects)[0],model,data)
In [69]:
import sklearn.metrics
In [70]:
sklearn.metrics.roc_auc_score(l,p)
Out[70]:
That seems a little high... Especially as we're only running this with one feature, and default settings.
In [73]:
fpr,tpr,thresholds = sklearn.metrics.roc_curve(l,p)
In [76]:
plt.plot(fpr,tpr)
Out[76]:
So the performance is artificially high. To check this isn't real, will use this feature and settings on the test data and submit it.
In [77]:
features = list(data.keys())
In [79]:
%%time
predictiondict = {}
for subj in subjects:
# training step
X,y = utils.build_training(subj,features,data)
model.fit(X,y)
# prediction step
X,segments = utils.build_test(subj,features,data)
predictions = model.predict_proba(X)
for segment,prediction in zip(segments,predictions):
predictiondict[segment] = prediction
In [80]:
import csv
In [81]:
with open("output/protosubmission.csv","w") as f:
c = csv.writer(f)
c.writerow(['clip','preictal'])
for seg in predictiondict.keys():
c.writerow([seg,"%s"%predictiondict[seg][-1]])
Scored 0.52433, so no, it's not some magical classifier.
Anyway, at least that tells us what score this test we're building should be returning, in this case. Trying to replicate by iterating the above function over all subjects.
In [85]:
pls = list(map(lambda s: subjpredictions(s,model,data), subjects))
In [91]:
# this line is going to be problematic
p,l = list(map(np.hstack,list(zip(*pls))))
In [93]:
p.shape
Out[93]:
In [94]:
l.shape
Out[94]:
In [95]:
sklearn.metrics.roc_auc_score(l,p)
Out[95]:
In [96]:
fpr,tpr,thresholds = sklearn.metrics.roc_curve(l,p)
plt.plot(fpr,tpr)
Out[96]:
In [100]:
sklearn.metrics.accuracy_score(l,list(map(int,p)))
Out[100]:
This doesn't make much sense, there must be an error above that's causing the classifier to be much more accurate than possible - causing massive overfitting.
Ok, so it turns out these problems are probably occurring because I'm not dealing with the unbalanced classes in all these datasets.
In each of the subject training sets there are more zeros than ones, and that was something I was going to sort out below using Bayes theorem.
However, there is an input to many of scikit-learn's functions called sample_weight to solve this problem because obviously it comes up all the time.
Repeating the above with the proper weightings:
In [123]:
def subjpredictions(subject,model,data):
X,y = utils.build_training(subject,list(data.keys()),data)
cv = sklearn.cross_validation.StratifiedShuffleSplit(y)
predictions = []
labels = []
allweights = []
for train,test in cv:
# calculate weights
weight = len(y[train])/sum(y[train])
weights = np.array([weight if i == 1 else 1 for i in y[train]])
model.fit(X[train],y[train],clf__sample_weight=weights)
predictions.append(model.predict_proba(X[test]))
weight = len(y[test])/sum(y[test])
weights = np.array([weight if i == 1 else 1 for i in y[test]])
allweights.append(weights)
labels.append(y[test])
predictions = np.vstack(predictions)[:,1]
labels = np.hstack(labels)
weights = np.hstack(allweights)
return predictions,labels,weights
In [136]:
plws = list(map(lambda s: subjpredictions(s,model,data), subjects))
p,l,w = list(map(np.hstack,list(zip(*plws))))
In [141]:
sklearn.metrics.roc_auc_score(l,p,sample_weight=w)
Out[141]:
In [142]:
fpr,tpr,thresholds = sklearn.metrics.roc_curve(l,p,sample_weight=w)
plt.plot(fpr,tpr)
Out[142]:
Looks like that hasn't fixed the problem. Unfortunately, I don't know why.
Checking what effect this has had on the leaderboard position:
In [146]:
predictiondict = {}
for subj in subjects:
# training step
X,y = utils.build_training(subj,features,data)
# weights
weight = len(y)/sum(y)
weights = np.array([weight if i == 1 else 1 for i in y])
model.fit(X,y,clf__sample_weight=weights)
# prediction step
X,segments = utils.build_test(subj,features,data)
predictions = model.predict_proba(X)
for segment,prediction in zip(segments,predictions):
predictiondict[segment] = prediction
In [147]:
with open("output/protosubmission.csv","w") as f:
c = csv.writer(f)
c.writerow(['clip','preictal'])
for seg in predictiondict.keys():
c.writerow([seg,"%s"%predictiondict[seg][-1]])
Ok, submitted and it worked. Got a score of 0.68118 and moved 93 positions up the leaderboard. Still, haven't succeeded in replicating their test though.
Could compare our performance in the AUC score to a dummy classifier:
In [149]:
import sklearn.dummy
In [151]:
dmy = sklearn.dummy.DummyClassifier(strategy="most_frequent")
dummy = sklearn.pipeline.Pipeline([('scl',scaler),('clf',dmy)])
In [152]:
plws = list(map(lambda s: subjpredictions(s,dummy,data), subjects))
p,l,w = list(map(np.hstack,list(zip(*plws))))
In [158]:
def subjpredictions(subject,model,data):
X,y = utils.build_training(subject,list(data.keys()),data)
cv = sklearn.cross_validation.StratifiedShuffleSplit(y)
predictions = []
labels = []
allweights = []
for train,test in cv:
# calculate weights
try:
weight = len(y[train])/sum(y[train])
weights = np.array([weight if i == 1 else 1 for i in y[train]])
model.fit(X[train],y[train],clf__sample_weight=weights)
except TypeError:
# model doesn't support weights
model.fit(X[train],y[train])
predictions.append(model.predict_proba(X[test]))
weight = len(y[test])/sum(y[test])
weights = np.array([weight if i == 1 else 1 for i in y[test]])
allweights.append(weights)
labels.append(y[test])
predictions = np.vstack(predictions)[:,1]
labels = np.hstack(labels)
weights = np.hstack(allweights)
return predictions,labels,weights
In [159]:
plws = list(map(lambda s: subjpredictions(s,dummy,data), subjects))
p,l,w = list(map(np.hstack,list(zip(*plws))))
In [160]:
sklearn.metrics.roc_auc_score(l,p,sample_weight=w)
Out[160]:
Wait, no that obviously won't work.
Undersampling the majority class instead.
In [165]:
plws = list(map(lambda s: subjpredictions(s,model,data), subjects))
p,l,w = list(map(np.hstack,list(zip(*plws))))
In [163]:
import random
In [179]:
k = sum(l)
samples = random.sample([x for x,i in list(enumerate(l == 0)) if i==True],k) + [x for x,i in list(enumerate(l == 1)) if i==True]
In [181]:
sklearn.metrics.roc_auc_score(l[samples],p[samples],sample_weight=w[samples])
Out[181]:
Well, that hasn't worked either.
Checking my sampling has actually made sense:
In [183]:
sum(l[samples])/len(l[samples])
Out[183]:
Yeah, sampling has worked, classifier is still just performing too well.
Will have to think about this, can't think of anything else to fix this with.